-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] add client side health-check to detect network failures. #31640
Conversation
Should we also do this here ray/src/ray/rpc/gcs_server/gcs_rpc_client.h Line 195 in bc3114d
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to test this?
And how about the python client's configs?
cc @shomilj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request change until addressing Ricky's comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to change config from the python side. I am not sure if we have grpc client other than https://github.com/ray-project/ray/blob/master/python/ray/_private/gcs_pubsub.py. Maybe we should aggregate all grpc client usage to a single file for the global config for python too?
hmm the challenging part is to simulate the failure where we terminate the node without sending FIN on the tcp connection.. |
Tried both reboot and preempt spot instances while job is running, the Ray is able to detect in both case node failed. |
Why are these changes needed?
Occasionally Ray users have seen
ray.get
hanging, when the node executing the taskray.get
is waiting for is preempted and disconnected from the cluster.As we debug one instance of such hanging, we figured this is caused by that the underlining grpc channel failed to detect this network failure.
To solve this problem, we need add some sort of health check at OS level (TCP keep alive), rpc level (grpc), or application (Ray) level. It seems not easy to configure TCP Keepalive in grpc, and the Ray level involves changing a lot of code, this PR made the change at grpc level.
Also note in Ray we assume network failure as component failure, we set up a more loose timeout to reduce the false positive.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.